Skip to content

Conversation

@dichn
Copy link
Contributor

@dichn dichn commented Aug 1, 2025

Purpose

  • Add calibrate_all_experts option to improve MoE calibration

Changes

  • Add calibrate_all_experts flag to MoE layers
  • Update replace_modules_for_calibration and moe_calibration_context
    to propagate the flag into modules
  • Modify expert forward passes:
    • Normal mode (default): compute output only for tokens routed to
      top-k experts, and combine their weighted results in the final
      output
    • Calibration mode (calibrate_all_experts=True): compute output for
      all tokens on every expert, but still apply the top-k gating to
      decide which token outputs contribute to the final result.

Testing

  • Add unit test to verify all experts are triggered during MoE calibration

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @dichn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility of Mixture-of-Experts (MoE) model calibration by introducing configurable control over expert execution and output contribution. My changes allow for more precise calibration strategies, enabling scenarios where all experts are evaluated regardless of routing, or where expert activations are computed without affecting the final model output.

Highlights

  • New Configuration Class: I've introduced a new CalibrationConfig class in llmcompressor/modeling/config.py to centralize and manage parameters for Mixture-of-Experts (MoE) calibration. This class includes moe_calibrate_all_experts and moe_calibrate_gated_acts boolean flags, with validation to prevent unsupported configurations.
  • Conditional Expert Execution: I've implemented conditional expert execution logic within the MoE layers for DeepseekV3, Llama4, and Qwen3 models. This allows for flexible control during calibration: moe_calibrate_all_experts ensures all experts run their forward pass, while moe_calibrate_gated_acts determines if their outputs contribute to the final hidden state.
  • Integration with Model Preparation: The replace_modules_for_calibration and moe_calibration_context functions in llmcompressor/modeling/prepare.py have been updated to accept and pass the new calibration configuration. This ensures that the desired calibration behavior is applied when MoE modules are replaced or temporarily modified.
  • Unit Test Coverage: I've added new unit tests for DeepseekV3, Llama4, and Qwen3 MoE layers to verify that all experts are triggered when moe_calibrate_all_experts is enabled and moe_calibrate_gated_acts is disabled, ensuring the intended calibration behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

github-actions bot commented Aug 1, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces conditional expert calibration for Mixture-of-Experts (MoE) models, controlled by a new CalibrationConfig. The changes allow for more flexible calibration setups. A critical issue was identified where the moe_calibrate_gated_acts flag was not correctly implemented in the model forward passes, leading to incorrect behavior when set to False. A suggestion was also made to improve the clarity of an error message.

@dichn
Copy link
Contributor Author

dichn commented Aug 1, 2025

@kylesayrs
The new introduced tests/llmcompressor/modeling/test_calib_llama4.py is currently untested due to regional access restrictions to the llama4 repo (I have appended a pytest.skip mark of it) . Could you please help run the test on your end?

@dichn
Copy link
Contributor Author

dichn commented Aug 2, 2025

Re-pushed for

  • fix the unconditional final output addition
  • calibration configuration error message improvement
  • add unit test for calibration configuration false scenario

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! Do you mind listing how you’ve tested the updated examples?

@dichn
Copy link
Contributor Author

dichn commented Aug 4, 2025

Re-pushed for adding missing Llama4ForConditionalGeneration in test_calib_llama4.py.

Do you mind listing how you’ve tested the updated examples?

  • For my test plan, I’ve verified the changes using the newly added unit tests. However, due to limited GPU capacity on my local development machine (my laptop), I haven't validated the patch against a full model.

Note on the skipped test test_calib_llama4.py:

  • This test is currently marked with @pytest.mark.skip because I haven't been able to verify it due to regional access restrictions to LLaMA 4 resources. I’ve asked Kyle to help run the test, and once it passes on his end, the skip mark can be safely removed.

CC: @dsikka @kylesayrs

@dichn dichn requested review from dsikka and kylesayrs August 4, 2025 13:17
Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, it looks great! I had a couple suggestions:

  • Documenting the flags in the class itself, so it's clearer to future users
  • Few naming suggestions (nits)

I'm running the llama4 example at the moment, will update here if it passes!



class CalibrationConfig(BaseModel):
moe_calibrate_all_experts: bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add more information in this config class around what these flags do for future readers, so it's clear which flag should be set for which mode?

I was thinking something like:

  | all_experts | gated_acts | Behavior                                                               |
  |-------------|------------|------------------------------------------------------------------------|
  | True        | True       | All experts run, routed experts contribute to output (current default) |
  | True        | False      | All experts run for calibration, but outputs ignored                   |
  | False       | True       | Only routed experts run and contribute (standard inference)            |
  | False       | False      | Invalid configuration (raises error)                                   |

from pydantic import BaseModel, model_validator


class CalibrationConfig(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What do you think about renaming to MoECalibrationConfig?


class CalibrationConfig(BaseModel):
moe_calibrate_all_experts: bool
moe_calibrate_gated_acts: bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider renaming to something like use_gated_outputs since The name suggests it's about "calibrating
gated activations" but it actually controls whether expert outputs contribute to the final result.

@rahul-tuli
Copy link
Collaborator

Update: The llama4 test failed for me locally with:

pytest tests/llmcompressor/modeling/test_calib_llama4.py
========================================================= test session starts ==========================================================
platform linux -- Python 3.10.12, pytest-8.4.1, pluggy-1.6.0
rootdir: /home/rahul/llm-compressor
configfile: pyproject.toml
plugins: rerunfailures-15.1, mock-3.14.1
collected 1 item                                                                                                                       

tests/llmcompressor/modeling/test_calib_llama4.py F                                                                              [100%]

=============================================================== FAILURES ===============================================================
_________________________ test_calib_replace_llama4_moe_all_experts[meta-llama/Llama-4-Scout-17B-16E-Instruct] _________________________

model_stub = 'meta-llama/Llama-4-Scout-17B-16E-Instruct'

    @pytest.mark.parametrize("model_stub", ["meta-llama/Llama-4-Scout-17B-16E-Instruct"])
    def test_calib_replace_llama4_moe_all_experts(model_stub):
        with skip_weights_download(Llama4ForConditionalGeneration):
            model = Llama4ForConditionalGeneration.from_pretrained(
                model_stub, torch_dtype="auto"
            )
    
        replace_modules_for_calibration(
            model, moe_calibrate_gated_acts=False, moe_calibrate_all_experts=True
        )
    
        # Find a Llama4 MoE layer
        moe_layer = None
>       for _, module in model.modules():
E       TypeError: cannot unpack non-iterable Llama4ForConditionalGeneration object

tests/llmcompressor/modeling/test_calib_llama4.py:25: TypeError
-------------------------------------------------------- Captured stdout setup ---------------------------------------------------------
2025-08-06T09:19:24.192277-0400 | reset | INFO - Compression lifecycle reset
--------------------------------------------------------- Captured stderr call ---------------------------------------------------------
Fetching 13 files: 100%|██████████| 13/13 [00:00<00:00, 21.27it/s]
The following layers were not sharded: vision_model.positional_embedding_vlm, vision_model.model.layers.*.self_attn.o_proj.weight, language_model.model.layers.*.input_layernorm.weight, language_model.model.layers.*.self_attn.q_proj.weight, vision_model.model.layers.*.mlp.fc*.bias, vision_model.model.layers.*.self_attn.v_proj.bias, vision_model.patch_embedding.linear.weight, vision_model.layernorm_pre.weight, language_model.model.layers.*.feed_forward.shared_expert.gate_proj.weight, vision_model.layernorm_pre.bias, vision_model.model.layers.*.self_attn.v_proj.weight, language_model.model.embed_tokens.weight, language_model.model.layers.*.feed_forward.shared_expert.up_proj.weight, vision_model.model.layers.*.self_attn.k_proj.weight, vision_model.model.layers.*.input_layernorm.weight, vision_model.model.layers.*.post_attention_layernorm.bias, vision_model.vision_adapter.mlp.fc*.weight, language_model.model.layers.*.feed_forward.shared_expert.down_proj.weight, vision_model.layernorm_post.bias, language_model.model.layers.*.feed_forward.router.weight, language_model.model.layers.*.feed_forward.experts.down_proj, vision_model.model.layers.*.self_attn.q_proj.bias, vision_model.model.layers.*.self_attn.o_proj.bias, vision_model.model.layers.*.input_layernorm.bias, vision_model.model.layers.*.self_attn.k_proj.bias, language_model.model.layers.*.post_attention_layernorm.weight, vision_model.model.layers.*.post_attention_layernorm.weight, vision_model.model.layers.*.mlp.fc*.weight, vision_model.layernorm_post.weight, vision_model.model.layers.*.self_attn.q_proj.weight, language_model.lm_head.weight, language_model.model.layers.*.self_attn.v_proj.weight, multi_modal_projector.linear_*.weight, language_model.model.layers.*.self_attn.o_proj.weight, vision_model.class_embedding, language_model.model.layers.*.self_attn.k_proj.weight, language_model.model.norm.weight, language_model.model.layers.*.feed_forward.experts.gate_up_proj
------------------------------------------------------- Captured stdout teardown -------------------------------------------------------
2025-08-06T09:23:08.745259-0400 | reset | INFO - Compression lifecycle reset
======================================================= short test summary info ========================================================
FAILED tests/llmcompressor/modeling/test_calib_llama4.py::test_calib_replace_llama4_moe_all_experts[meta-llama/Llama-4-Scout-17B-16E-Instruct] - TypeError: cannot unpack non-iterable Llama4ForConditionalGeneration object
==================================================== 1 failed in 234.66s (0:03:54) =====================================================

I will take a look in sometime!

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will the user configure these arguments?

Currently, for qwen3, we pass in calibrate_moe_context=True for nvfp4:
https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/qwen_30b_a3b.py#L70

This allows us to temporarily update the moe blocks with the blocks defined in modeling/qwen3_moe.py - is the plan to keep this argument?

@dichn
Copy link
Contributor Author

dichn commented Aug 7, 2025

As noted by @kylesayrs, this PR aligns with a spec change that drops moe_calibrate_gated_acts in favor of supporting only moe_calibrate_all_experts, simplifying the implementation.
Marking this PR as a draft again.
CC: @rahul-tuli @dsikka (Thank you for your review.)

@dichn dichn marked this pull request as draft August 7, 2025 08:08
@dichn
Copy link
Contributor Author

dichn commented Aug 18, 2025

Re-pushed for dropping moe_calibrate_gated_acts in favor of supporting only calibrate_all_experts.
And updated the PR description.

@dichn dichn marked this pull request as ready for review August 19, 2025 08:25
Change Purpose:
- Add calibrate_all_experts option to improve MoE calibration

Change Details:
- Add `calibrate_all_experts` flag to MoE layers
- Update `replace_modules_for_calibration` and `moe_calibration_context`
  to propagate the flag into modules
- Modify expert forward passes:
  * Normal mode (default): compute output only for tokens routed to
    top-k experts, and combine their weighted results in the final
    output
  * Calibration mode (`calibrate_all_experts=True`): compute output for
    all tokens on every expert, but still apply the top-k gating to
    decide which token outputs contribute to the final result.

Testing:
- Add unit test to verify all experts are triggered during MoE calibration
@dichn dichn marked this pull request as draft August 21, 2025 08:26
@dsikka dsikka closed this Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants